In this first section, we generate data using the generators given, and use this as a use case for SVMs. We implement both the primal and dual in AMPL.
We generate the data using gensvdat, where we use our 2
students identifier numbers for the seed for training and test,
e.g. :
./gensvmdat tes_raw.dat 100 4624 # te = test, s = small -> tes
./gensvmdat trl_raw.dat 100 7438042 # tr = train, l = large -> trl
We create 4 files in total: a small set of 100 points and a large one
of 100k points. We use the same size for testing and training. After
creating the raw data files, we utilize a shell script which performs
processing so make sure that the data files can be loaded into AMPL.
This consist of adding a header including variables m and
n, removing the * symbol and displaying in
terminal the number of misclassifications. Finally, we copy the
processed data files to the primal/ and dual/
directories.
We aim to solve the following optimization problem:
\[ \min_{w, \gamma, s} \frac{1}{2} w^T w + \nu e^T s \]
subject to: \[ Y(Aw + \gamma e) + s \geq e, \] \[ s \geq 0, \]
where:
We formulate the dual SVM model using explicit indices instead of matrix notation.
Objective Function \[ \max_{\lambda} \quad \sum_{i=1}^m \lambda_i - \frac{1}{2} \sum_{i=1}^m \sum_{j=1}^m \lambda_i \lambda_j y_i y_j \left( \sum_{k=1}^n A_{ik} A_{jk} \right) \]
subject to: 1. \[ \sum_{i=1}^m \lambda_i y_i = 0 \] 2. \[ 0 \leq \lambda_i \leq \nu, \quad \forall i = 1, \dots, m \]
where:
The separation hyperplane in a Support Vector Machine (SVM) is determined by \(\mathbf{w}\) (the weight vector) and \(\gamma\) (the bias term). It defines the decision boundary that separates the two classes. The equation of the hyperplane is: \[ \sum_{j=1}^n w_j x_j + \gamma = 0 \]
where: - \(\mathbf{w} = [w_1, w_2, \dots, w_n]\) is the weight vector, - \(\gamma\) is the bias term, - \(\mathbf{x} = [x_1, x_2, \dots, x_n]\) represents the coordinates of a point in \(n\)-dimensional space.
The decision rule for classification is: \[ f(\mathbf{x}) = \text{sign} \left( \sum_{j=1}^n w_j x_j + \gamma \right) \]
Since the separation hyperplane only depends on w and
\(\gamma\), we know that the if these
are the same for the primal and dual, then the hyperplane will be the
same as well. To find the the values of w and \(\gamma\) for the dual problem, we use these
formulas: \[
w = \sum_{i=1}^m \lambda_i y_i \phi (x_i)
\] where \(\phi\) is the
identity matrix, in this particular case. \[
\gamma = y_k - \sum_{j=1}^n w_j \cdot A_{k,j}
\] where k is the index of the first support vector.
We will find support vectors based on the property: \(0 < \lambda_k < \nu\) for all \(k \in SV\).
cd /Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/primal
/Users/danilakokin/Downloads/ampl_macos64/ampl <<EOF
option solver cplex;
model primal.mod;
data trs.dat;
let nu := 0.9;
solve;
display n, gamma, w;
display n, gamma, w > sparams.dat;
quit;
EOF
## CPLEX 22.1.1.0: optimal solution; objective 45.1830374
## 11 separable QP barrier iterations
## No basis.
## n = 4
## gamma = -3.65763
##
## w [*] :=
## 1 1.28474
## 2 1.97146
## 3 2.35625
## 4 2.05919
## ;
cd /Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/primal
/Users/danilakokin/Downloads/ampl_macos64/ampl <<EOF
option solver cplex;
model primal.mod;
data trl.dat;
let nu := 0.9;
solve;
display gamma, w;
display n, gamma, w > lparams.dat;
quit;
EOF
## CPLEX 22.1.1.0: optimal solution; objective 26542.77347
## 14 separable QP barrier iterations
## No basis.
## gamma = -10.1811
##
## w [*] :=
## 1 5.0655
## 2 5.09634
## 3 5.12616
## 4 5.08233
## ;
To calculate the w* and \(\gamma\) we use the aforementioned
formulas, which we implement in a for loop in AMPL.
cd /Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/dual
/Users/danilakokin/Downloads/ampl_macos64/ampl <<EOF
option solver cplex;
model dual.mod;
data trs.dat;
let nu := 0.9;
solve;
param w {j in 1..n};
for {j in 1..n} {
let w[j] := sum {i in 1..m} lambda[i] * y[i] * A[i, j];
}
param svi default 0;
for {i in 1..m} {
if lambda[i] > 1e-5 and lambda[i] < nu then {
let svi := i;
break;
}
}
check: svi > 0;
param gamma := y[svi] - sum {j in 1..n} w[j] * A[svi, j];
display gamma, w;
display n, gamma, w > sparams.dat;
quit;
EOF
## CPLEX 22.1.1.0: optimal solution; objective 45.18303736
## 12 QP barrier iterations
## No basis.
## gamma = -3.65763
##
## w [*] :=
## 1 1.28474
## 2 1.97146
## 3 2.35625
## 4 2.05919
## ;
cd /Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/dual
/Users/danilakokin/Downloads/ampl_macos64/ampl <<EOF
option solver cplex;
model dual.mod;
data trs.dat;
let nu := 0.9;
solve;
param w {j in 1..n};
for {j in 1..n} {
let w[j] := sum {i in 1..m} lambda[i] * y[i] * A[i, j];
}
param svi default 0;
for {i in 1..m} {
if lambda[i] > 1e-5 and lambda[i] < nu then {
let svi := i;
break;
}
}
check: svi > 0;
param gamma := y[svi] - sum {j in 1..n} w[j] * A[svi, j];
display gamma, w;
display n, gamma, w > lparams.dat;
quit;
EOF
## CPLEX 22.1.1.0: optimal solution; objective 45.18303736
## 12 QP barrier iterations
## No basis.
## gamma = -3.65763
##
## w [*] :=
## 1 1.28474
## 2 1.97146
## 3 2.35625
## 4 2.05919
## ;
As you can see, the values for the objective function,
w* and \(\gamma\) are
identical (at least up to 5 decimals) for the dual and the primal for
the small and large dataset, which means they found both exactly the
same optimal hyperplane. The objective function finds the same value up
to (at least!) 8 decimals. This is consistent with theory: the dual
should be exactly the same as the primal, except with fewer constraints.
Any difference can be explained by numerical inaccuracies by the
implementation, for example by the solver. We use the default solver
(cplex) since it seemed most appropriate and is computationally
efficient, but we did look into alternatives.
cd /Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/primal
/Users/danilakokin/Downloads/ampl_macos64/ampl <<EOF
option solver cplex;
model eval.mod;
data tes.dat;
data sparamsformated.dat;
display accuracy, precision, recall, f1_score;
quit;
EOF
## accuracy = 0.88
## precision = 0.842105
## recall = 0.941176
## f1_score = 0.888889
cd /Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/primal
/Users/danilakokin/Downloads/ampl_macos64/ampl <<EOF
option solver cplex;
model eval.mod;
data tel.dat;
data lparamsformated.dat;
display accuracy, precision, recall, f1_score;
quit;
EOF
## accuracy = 0.94773
## precision = 0.946665
## recall = 0.948955
## f1_score = 0.947809
cd /Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/dual
/Users/danilakokin/Downloads/ampl_macos64/ampl <<EOF
option solver cplex;
model eval.mod;
data tes.dat;
data sparamsformated.dat;
display accuracy, precision, recall, f1_score;
quit;
EOF
## accuracy = 0.88
## precision = 0.842105
## recall = 0.941176
## f1_score = 0.888889
As observed, both primal and dual SVMs produce identical results, achieving an accuracy of 88% and 95% for the small and large training dataset, respectively. This aligns with theoretical expectations: based on large samples we see that roughly 5% of generated data is a misclassification, meaning that 95% is the theoretical maximum for accuracy. We do not show the dual applied to the large dataset for this evaluation, because of computational limitations, but we know from before that the obtained results are identical to the primal trained on the large dataset.
As a real-world testing data for our SVMs we have selected Banknote dataset (https://archive.ics.uci.edu/dataset/267/banknote+authentication). Images of genuine and forged banknotes were digitized using an industrial camera (400x400 pixels, 660 dpi). There are 4 features which were extracted using wavelet transformation and 1 target:
variance of Wavelet Transformed image (continuous)
skewness of Wavelet Transformed image (continuous)
kurtosis of Wavelet Transformed image (continuous)
entropy of image (continuous)
class (binary target: 0 for authentic, 1 for inauthentic)
# Load the data
df <- read.csv("/Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/data_formatter/money.csv")
barplot_heights <- table(df$class)
barplot_positions <- barplot(barplot_heights,
main = "Bar Plot of Class (target)",
xlab = "Class",
ylab = "Frequency",
col = c("red", "blue"),
ylim = c(0, max(barplot_heights) + 10))
text(x = barplot_positions, y = barplot_heights,
labels = barplot_heights, pos = 3, cex = 0.8)
legend("topright", legend = c("Class 0: authentic ", "Class 1: inauthentic"),
fill = c("red", "blue"), title = "Class")
The target classes are balanced; however, some preprocessing steps are still required. The id column needs to be removed, the dataset should be split into equal train and test sets, normalized, and the target values should be converted from (0, 1) to (-1, 1).
# Convert 'class' column to binary variable (1 and -1)
df$class <- ifelse(df$class == 1, 1, -1)
df$id <- NULL
# Split the data into train and test sets
set.seed(123)
trainll <- sample(1:nrow(df), size = nrow(df) / 2)
testll <- setdiff(1:nrow(df), trainll)
train <- df[trainll, ]
test <- df[testll, ]
numeric_cols <- names(train)[sapply(train, is.numeric) & names(train) != "class"]
# Compute min and max values for each numeric column in the training set
min_vals <- sapply(train[, numeric_cols], min, na.rm = TRUE)
max_vals <- sapply(train[, numeric_cols], max, na.rm = TRUE)
# Define normalization function
normalize <- function(x, min_val, max_val) {
(x - min_val) / (max_val - min_val)
}
# Apply normalization to the train set
for (col_name in numeric_cols) {
min_val <- min_vals[col_name]
max_val <- max_vals[col_name]
train[[col_name]] <- normalize(train[[col_name]], min_val, max_val)
}
# Normalize test data (except target) based on the train normalizer
for (col_name in numeric_cols) {
min_val <- min_vals[col_name]
max_val <- max_vals[col_name]
test[[col_name]] <- normalize(test[[col_name]], min_val, max_val)
}
summary(train)
## variance kurtosis skewness entropy
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3786 1st Qu.:0.4594 1st Qu.:0.1532 1st Qu.:0.5596
## Median :0.5488 Median :0.6009 Median :0.2503 Median :0.7261
## Mean :0.5420 Mean :0.5878 Mean :0.2865 Mean :0.6656
## 3rd Qu.:0.7066 3rd Qu.:0.7695 3rd Qu.:0.3612 3rd Qu.:0.8229
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## class
## Min. :-1.0000
## 1st Qu.:-1.0000
## Median :-1.0000
## Mean :-0.1312
## 3rd Qu.: 1.0000
## Max. : 1.0000
cat("Number of rows in train set:", nrow(train), "\n")
## Number of rows in train set: 686
summary(test)
## variance kurtosis skewness entropy
## Min. :-0.0004112 Min. :0.004258 Min. :-0.003033 Min. :-0.07669
## 1st Qu.: 0.3809555 1st Qu.:0.446418 1st Qu.: 0.163883 1st Qu.: 0.52343
## Median : 0.5317613 Median :0.602478 Median : 0.252592 Median : 0.72085
## Mean : 0.5358386 Mean :0.586773 Mean : 0.285057 Mean : 0.66009
## 3rd Qu.: 0.7150139 3rd Qu.:0.771197 3rd Qu.: 0.367578 3rd Qu.: 0.82126
## Max. : 0.9811344 Max. :0.991715 Max. : 0.986078 Max. : 1.02885
## class
## Min. :-1.00000
## 1st Qu.:-1.00000
## Median :-1.00000
## Mean :-0.09038
## 3rd Qu.: 1.00000
## Max. : 1.00000
cat("Number of rows in test set:", nrow(test), "\n")
## Number of rows in test set: 686
train_file <- "/Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/data_formatter/trm.csv"
test_file <- "/Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/data_formatter/tem.csv"
write.csv(train, file = train_file, row.names = FALSE)
write.csv(test, file = test_file, row.names = FALSE)
data <- df
features <- colnames(data)[1:4]
target <- colnames(data)[5]
num_features <- length(features)
par(mfrow = c(2, 3)) # Set up a plotting grid
for (i in 1:(num_features - 1)) {
for (j in (i + 1):num_features) {
# Plot each feature pair
plot(data[[features[i]]], data[[features[j]]],
col = ifelse(data[,target] == 1, "blue", "red"),
xlab = features[i], ylab = features[j],
main = paste("Feature Pair:", features[i], "vs", features[j]))
}
}
As observed, no feature combination plot demonstrates linearly separable classes.
data <- df
features <- colnames(data)[1:4]
target <- colnames(data)[5]
pca_result <- prcomp(data[, features], scale. = TRUE)
cat("Cumulative explained variance\n")
## Cumulative explained variance
summary(pca_result)$importance[3, 1:3] * 100
## PC1 PC2 PC3
## 54.498 86.826 95.611
pca_data <- data.frame(PC1 = pca_result$x[, 1],
PC2 = pca_result$x[, 2],
PC3 = pca_result$x[, 3],
Label = data[[target]])
plot <- plot_ly(
data = pca_data,
x = ~PC1, y = ~PC2, z = ~PC3,
color = ~Label,
colors = c("red", "blue"),
type = "scatter3d",
mode = "markers",
marker = list(size = 2)
) %>%
layout(
scene = list(
xaxis = list(title = "PC 1"),
yaxis = list(title = "PC 2"),
zaxis = list(title = "PC 3")
),
title = "3D PCA Plot"
)
plot
As a final demonstration, we can plot a 3D visualization of the first three principal components, which explain approximately 95% of the variance. The plot clearly shows that the classes are not linearly separable, confirming the suitability of this data for training SVM models.
cd /Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/primal
/Users/danilakokin/Downloads/ampl_macos64/ampl <<EOF
option solver cplex;
model primal.mod;
data trm.dat;
let nu := 0.9;
solve;
display n, gamma, w;
display n, gamma, w > mparams.dat;
quit;
EOF
## CPLEX 22.1.1.0: optimal solution; objective 106.7440306
## 19 separable QP barrier iterations
## No basis.
## n = 4
## gamma = 7.59591
##
## w [*] :=
## 1 -5.98142
## 2 -5.26369
## 3 -5.58756
## 4 0.132916
## ;
cd /Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/primal
/Users/danilakokin/Downloads/ampl_macos64/ampl <<EOF
option solver cplex;
model eval.mod;
data tem.dat;
data mparamsformated.dat;
display true_positive, true_negative, false_positive, false_negative;
display accuracy, precision, recall, f1_score;
quit;
EOF
## true_positive = 312
## true_negative = 360
## false_positive = 14
## false_negative = 0
##
## accuracy = 0.979592
## precision = 0.957055
## recall = 1
## f1_score = 0.978056
cd /Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/dual
/Users/danilakokin/Downloads/ampl_macos64/ampl <<EOF
option solver cplex;
model dual.mod;
data trm.dat;
let nu := 0.9;
solve;
param w {j in 1..n};
for {j in 1..n} {
let w[j] := sum {i in 1..m} lambda[i] * y[i] * A[i, j];
}
param svi default 0;
for {i in 1..m} {
if (lambda[i] > 1e-6) and (lambda[i] + 1e-7 < nu) then {
let svi := i;
break;
}
}
check: svi > 0;
display svi;
param gamma := y[svi] - sum {j in 1..n} w[j] * A[svi, j];
display gamma, w;
display n, gamma, w > mparams.dat;
quit;
EOF
## CPLEX 22.1.1.0: optimal solution; objective 106.7440305
## 21 QP barrier iterations
## No basis.
## svi = 158
##
## gamma = 7.59591
##
## w [*] :=
## 1 -5.98142
## 2 -5.26369
## 3 -5.58756
## 4 0.132916
## ;
cd /Users/danilakokin/Desktop/UPC/Semester3/OTDM/OTDM_Project_2/dual
/Users/danilakokin/Downloads/ampl_macos64/ampl <<EOF
option solver cplex;
model eval.mod;
data tem.dat;
data mparamsformated.dat;
display true_positive, true_negative, false_positive, false_negative;
display accuracy, precision, recall, f1_score;
quit;
EOF
## true_positive = 312
## true_negative = 360
## false_positive = 14
## false_negative = 0
##
## accuracy = 0.979592
## precision = 0.957055
## recall = 1
## f1_score = 0.978056
As we can see both primal and dual SVMs efficiently produce
prediction of the money class with accuracy of 98%, which is an
outstanding result. They both get the same predictions on the test set,
so the confusion matrix is identical. The w* and \(\gamma\) values are identical as far as we
can tell, which is up to 5 decimals. In order to get this result, we had
to be smart about the implementation for finding the SV index: in AMPL
there is not strict inequality, so instead of \(\lambda_i < \nu\) we had to code \(\lambda_i + 1e-7 < \nu\), otherwise it
would return a datapoint that had a non-zero slack, meaning that \(\lambda_i = \nu\).